Chapter 3, Part 4: Neural Networks

Machine Learning

Author

F.San Segundo & N.Rodríguez

Published

February 2026


Hide the code
import torch

def get_best_device():
    if torch.backends.mps.is_available():
        device = torch.device("mps")
        backend = "Metal (MPS)"
    elif torch.cuda.is_available():
        device = torch.device("cuda")
        backend = f"CUDA ({torch.cuda.get_device_name(0)})"
        # import GPUtil
        # gpus = GPUtil.getAvailable(order='memory', limit=1)
        # print(f"GPU Memory Free: {GPUtil.getGPUs()[0].memoryFree} MB")
    else:
        device = torch.device("cpu")
        backend = "CPU"
    
    print(f"Using backend: {backend}")
    return device

device = get_best_device()
print(f"Device: {device}")

#Manual override for testing
device = torch.device("cpu")
Using backend: CPU
Device: cpu
Setting the working directory

We begin by using cd to make sure we are in the right folder.

Hide the code
if device.type == "cpu":
    %cd ./3_4_NeuralNetworks
[Errno 2] No such file or directory: './3_4_NeuralNetworks'
/wd/3_4_NeuralNetworks

Introduction to Neural Networks

Hide the code
# Required imports

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# import string

import pickle

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_circles
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.metrics import classification_report
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

#import joblib
import warnings

import networkx as nx

Introduction to Artificial Neural Networks

Artificial Neurons

An artificial neuron is a computational unit that performs a quite simple mathematical operation; they are usually illustrated with diagrams such as the picture below (this one is Figure 13-2 from (Glassner 2021)). The computation is represented as a graph where the operations proceed left to right:

  • The artificial neuron takes a series of numerical values as inputs (the leftmost arrows entering the graph). Let us represent these inputs as \(s_1, s_2, \ldots, s_k\).
  • Each input is multiplied by a certain weight \(w_1, w_2, \ldots, w_k\) associated with the arrow corresponding to that input and the products are used to form a linear combination: \[w_1s_1 + w_2s_2 + \cdots + w_ks_k\] This step is indicated in the diagram by the plus sign in a circle.
  • A bias term \(w_0\) is added to the linear combination: \[w_0 + w_1s_1 + w_2s_2 + \cdots + w_ks_k\] To simplify the discussion and notation we will normally use the bias trick, adding an extra input with fixed value 1 so that the inputs are \(1, s_1, s_2, \ldots, s_k\) and then the bias can be included into the list of weights.
  • Finally, the linear combination goes into an activation function \(h\), represented by the step function in the righmost node of the diagram. We will later discuss in detail what activation functions are used in practice, but the resulting value, the output of the artificial neuron will always be \[h(w_0 + w_1s_1 + w_2s_2 + \cdots + w_ks_k)\]

Artificial neurons were introduced in the 1940s (see (Fitch 1944)) and captured the interest of the research community up to the late 1960s. They were conceived in an attempt to provide an abstract model of the operations performed by the neurons in the animal brains. Of course, a biological neuron is a far more sophisticated system with a complex regulation composed of chemical and electrical signals. To say that the artificial neuron is a simplistic model of the biological reality is a huge understatement. But the analogy sticked, and up to the present these computational units are still referred to as artificial neurons. You can learn more about the first years of artificial intelligence in this video and in the book (Hecht-Nielsen 1990). See also the first pages of the lecture notes from past year and the references provided therein.


Artificial Neural Networks (ANNs)

Let us think about the impact of the choice of activation function on the class of problems that can be addressed by an artificial neuron. If the activation function is the identity, meaning

\(h(x) = x,\)

then the artificial neuron is nothing but a linear regression model. If it is a sigmoidal function, as in

\(h(x) = \frac{1}{1 + e^{-x}}\)

then the artificial neuron model is simply a logistic regression model. In any case, for any choice of activation function, a single neuron model can only be used for linearly separable problems.

Things change when we consider not a single neuron, but networks of several, possibly many, interconnected neurons. This is the idea of the Artificial Neural Networks (ANNs). These are models composed of artifical neurons as building blocks, whose architecture can be described by a (directed) graph of nodes and arrows that indicate how the computations (for training and prediction) are performed using the model. One such graph appears below (figure 13-7 in (Glassner 2021))

As you can see, the output of any neuron in the network can be used as the input to one or several other neurons in the network, downstream in the direction indicated by the arrows connecting the neurons. Note also some simplifications that are common in this type of ANN diagrams: + We do not include the weights associated with each input. Every arrow entering a neuron is assumed to carry an implicit weight along with it. + The situation for the bias is even more extreme: the bias terms do not appear, not even as an arrow implying their presence. But they are there and you need to keep in mind that every neuron includes its own bias term.

There is in principle no limitation to the topology or architecture of the ANN; that is, the number and ways in which neurons are connected to each other. However, already in the early days of AI it was realized that sticking to some particular families of ANN architectures greatly simplified their understanding and their use.


Layers of Neurons and the Multilayer Perceptron.

The fundamental (but not the unique) building block of today’s ANNs architectures is the layer of neurons. And the simplest and historically the first example of the use of this concept is the multilayer perceptron network, like the one in the picture below (figure 13-11 in (Glassner 2021)):

A multilayer perceptron network has:

  • One input layer, which simply contains the (vector of) values of the input data set.
  • One or more hidden layers of neurons. In modern ANNs there are a number of different types of hidden layers. But for the multilayer perceptron all the hidden layers are of the same type. They are dense (or fully connected) layers, meaning that each neuron in layer \(i\) is connected with every neuron in layer \(i+1\)
  • And an output layer. The structure of the output layer depends of the nature of the problem that the network is intended to solve. For a regression problem the output layer has a single neuron, that returns the predicted output value for the inputs we have used. In a binary classification problem the output layer will again have a single neuron, but the value is now interpreted as a the score of the positive class, as in other classfication algorithms we have seen. In a multiclass problem the output layer will have more neurons, each one producing the score corresponding to a given class (level) of the output factor.

We are going to discuss in detail all of these components below.

The multilayer perceptron is a particular type of feedforward neural network The term feedforward indicates that the information flows in one direction, from the input to the output (left to right in the above diagram), without any feedback loops or recurrent connections. The multilayer perceptrons are historically the first examples of neural networks. But there are other important members of the feedforward family, such as the radial basis function (RBF) networks, which are related to the gaussian mixture models we have seen. Also the convolutional neural networks (CNN) are technically feedforward networks as well.

Exercise 001

Reflections about the ANN architecture and the number of parameters: + How many parameters (weights and bias terms) does the network in the picture contain? + Now change the architecture to a single hidden layer, but keep the total number of neurons in the hidden layers constant. How does this affect the number of parameters of the network?


The Collapse of the Network and Activation Functions.

Recall that the output of an artificial neuron with inputs \((1, s_1, \ldots, s_k)\) and weights \((1, s_1, \ldots, s_k)\) (note the use of the bias trick) is given by \[h(w_0 + w_1s_1 + w_2s_2 + \cdots + w_ks_k)\] If the activation function is a linear function, the result is still a linear (more precisely affine) function of the neuron inputs. And extending that to the ANN, if all the activation functions in the hidden layers of the network are linear, then the ANN can only address linearly separable problems. This result holds irrespective of the number of layers, the number of neurons of each layer and, immportanly, the activation function of the output layer. This situation is described as the collapse of the network.

To summarize the idea, if we want our ANNs to go beyond linear models we need two things:
1. The network needs to include hidden layers (at least one). 2. The neurons in the hidden layers need to use non-linear activation functions.

If that is the case then it can be formally shown (through a mathematical proof) that a network with even a single hidden layer but with sufficiently many neurons can be used to approximate any continuous function. That is, multilayer perceptrons with non linear activation functions are universal function approximators (see e.g. (Hornik, Stinchcombe, and White 1990) and the references therein).

Linear Activation Functions are Used and Useful!

The above does not mean that linear activation functions are never used. For example, for a regression problem the activation function of the output layer is usually a linear function, in fact the identity \(h(x) = x\), is often used. There are also some more technical uses for linear activations in the hidden layers of neural networks.


Non Linear Activation Functions.

Let us review some of the best known and most widely used non linear activation functions.

Sigmoid Activation Function

This is a function that we already know from our work with logistic regression: \[\sigma(x) = \dfrac{1}{1 + e^{-x}}\] Let us use Python to plot it:

Hide the code
x = np.linspace(-15, 15, 100)
fig, ax = plt.subplots(figsize = (8, 3))
sns.lineplot(x = x, y = 1/(1 + np.exp(-x)))

Hyperbolic Tangent Activation Function

This function is similar in shape to the logistic curve, but it constrains the values to be in the \([-1, 1]\) interval (instead of the \([-1, 1]\) used by the logistic): \[\tanh(x) = \dfrac{e^{x} + e^{-x}}{e^{x} - e^{-x}}\]

Hide the code
fig, ax = plt.subplots(figsize = (8, 3))
sns.lineplot(x = x, y = np.tanh(x))

RELU (Rectified Linear) Activation Function

Probnly the most popular activation function nowadays, the RELU functions is defined as: \[ RELU(x) = \begin{cases} x &\text{ if }x \geq 0\\[3mm] 0 &\text{ if }x < 0\\ \end{cases} \] which means that the plot is:

Hide the code
def relu(x):
    return np.array([max(0, u) for u in x])

fig, ax = plt.subplots(figsize = (8, 3))
sns.lineplot(x = x, y = relu(x))

As you can see this function is not linear, but piecewise linear. However, this very basic kind of non linearity is enough to prevent the network collapse. And on the other hand the computation of the RELU value is a simple if statement, whereas the previous activation functions are computationally much more expensive. This is one of the reasons why RELU has become so popular in recent years.


The AI Winters and the 2020s AI Spring

Some theoretical misunderstandings related to the network collapse phenomenon combined with the limitations of the hardware led to what was called the first AI winter and research stalled for a long time. Another factor contributing to that decrease of interest in AI was of an algorithmic nature. Some pioneering researchers already knew that multilayer models and non linear activations would solve the issue. But at the time it was not known how to train a multilayer model. A new fundamental algorithm, called backpropagation was needed for that.

In the meantime, a second era of AI methods appeared in the form of so called expert systems in the 1980s. They relied on a completely different methodology: rule-based systems. Those are built on a set of explicit rules or if-then statements that try to encode the knowledge and expertise of human experts. In particular they did not use neural networks. But they turned out to be too hard to maintain, and the expectations, hype and investments soon vanished. The second AI winter and the bursting of the dot-com bubble (driven by speculative investment in internet-related companies during the late 1990s) occurred around the same time period. Thus, for a long time, the AI name vanished from the research vocabulary. But the neural networks researchers had kept working and in 1986 the backpropagation algorithm became popular due to a paper in Nature. It had some success, but it was not until the 2010s, when comparatively cheap yet powerful GPUs became available, that backprop (as it is usually called) found the adequate hardware to unleash its parallel computing capabilities. This has lead to what is sometimes called the current AI spring in the 2020s, and the popularization of some AI products such as LLMs. You can get a privileged perspective on that period by watching this interview with Geoffrey Hinton by Andrew Ng from deeplearning.ai.


The Output Layer.

The output layer of a neural network behaves differently depending on the nature of the problem.

  • Regression: If we are dealing with a regression problem then the output layer has a single neuron. This neuron will take as inputs the values coming from the last hidden layer, let us call them

    \(A_1,\ldots, A_k\)

    and then the output layer will use its set of weights \(w_1, \ldots, w_k\) and bias term to output the affine combination:

    \(w_0 + w_1 A_1 +\cdots + w_k A_k\)

    This is the final prediction of the model. You may wonder about the activation function for this output layer, but the above is equivalent to using the identity \(h(x) = x\) as activation, as we mentioned before.

  • Binary Classification: In this case we want to use the output as a score for the positive class of the output variable. In particular we only need the neural network to predict a aingle number. Therefore in this case again the output layer contains a single neuron. The only difference with the previous regression case is that now the activation function is not the identity but the sigmoid curve we have introduced above: \(\sigma(x) = \dfrac{1}{1 + e^{-x}}\)

  • Multiclass Classification and Softmax Function: This is a different scenario, because of the output variable is a factor with \(p\) levels, then the network needs to predict scores for each of them. And so in this case the output layer contains \(p\) neurons. The outputs \(u_1, u_2, \ldots, u_p\) of these neurons are then typically converted into a new set of numbers \(p_1, p_2,\ldots, p_k\) by using a softmax transformation defined by: \(p_1 = \dfrac{e^{u_1}}{e^{u_1} + e^{u_2} + \cdots + e^{u_p}},\quad p_2 = \dfrac{e^{u_2}}{e^{u_1} + e^{u_2} + \cdots + e^{u_p}},\quad \ldots \quad , \quad p_k = \dfrac{e^{u_k}}{e^{u_1} + e^{u_2} + \cdots + e^{u_p}}\)

    This transformation clearly implies that the numbers \(p_i\) verify

    \(0\leq p1, p_2, \ldots, p_k\leq 1\qquad\text{and}\qquad p_1 + p_2 + \cdots + p_k = 1\)

    and so they are easy to interpret as (usually uncalibrated) probabilities for the \(k\) output factor classes.


Multilayer Perceptron Examples in scikit-learn

A Simple Example for a Binary Classification Problem

Now that we have an initial idea of the architecture of the MLP, we will begin by using sklearn to train a neural network (a MLP in this example) for a binary classification example that we have already used in previous sessions. Keep in mind that sklearn provides implementations only for the most classical and basic ANN architectures, which are precisely the MLPs. For more sophisticated architectures (and to get the most of the hardware) we need to move our work to deep learning libraries such as Tensorflow/Keras or PyTorch.

Let us begin by recreating the example dataset. To make things more interesting we select an example where a linear boundary is clearly not sufficient.

Hide the code
X, Y = make_circles(n_samples=1000, factor=.2, noise=0.15, random_state=2024)

inputs = ["X" + str(k) for k in range(X.shape[1])]
output = "Y"

XTR, XTS, YTR, YTS = train_test_split(X, Y,
                                      test_size=0.2,  # percentage preserved as test data
                                      random_state=1, # seed for replication
                                      stratify = Y)   # Preserves distribution of y


scaler = StandardScaler()
XTR = scaler.fit_transform(XTR)
XTS = scaler.transform(XTS) 



dfTR = pd.DataFrame(XTR, columns=inputs)
dfTR[output] = YTR
dfTS = pd.DataFrame(XTS, columns=inputs)
dfTS[output] = YTS

fig, ax = plt.subplots(figsize = (5, 5))
sns.scatterplot(dfTR, x=inputs[0], y=inputs[1], hue=output, ax=ax)
plt.show()

Next we use the MLPClassifier class in sklearn to train a MLP. To define the model we need to describe the architecture of the neural network. In the case of MLPs, this amounts to defining the number of hidden layers and the number of neurons they contain. For this example we are going to use two hidden layers, and we describe them by using a tuple with the number of neurons in each layer.

Then we also need to select the activation function used by those hidden layers. The MLP implementation in sklearn only allows for chosing a fixed type of activation, that will be applied to all hidden layers. The activation for the output layer is then automatically selected depending on the output variable (a sigmoid for this binary classification problem).

Another important parameter is the number of fitting iterations. We have not discussed yet how we fit the neural network, but it should come as no surprise that this is an iterative process. If the default number of iterations is not enough to deliver convergence, we try to increase it.

Hide the code
# Create and train the MLPClassifier
mlp_binary = MLPClassifier(hidden_layer_sizes=(8, 6), activation="relu", max_iter=10000, random_state=2024)
mlp_binary.fit(XTR, YTR)

model = mlp_binary
model_name = "mlp_binary"

Let us also save the model as a pickle object.

Hide the code
make_circles_model = {'mlp_binary':mlp_binary}

# with open('make_circles_model.pkl', 'wb') as file:
#     pickle.dump(make_circles_model, file)

The fitted model object contains information about the MLP weights and bias terms in the coefs_ and intercepts_ properties. The coefs_ is a list of arrays, describing the weights that connect each layer to the preceding one. We can inspect their shapes with

Hide the code
[layer.shape for layer in model.coefs_]
[(2, 8), (8, 6), (6, 1)]

The first describes the 16 weights connecting the two inputs (point coordinates) to each of the eight neurons of the first hidden layer. Then the 48 weights connecting the two hidden layers are contained in

Hide the code
model.coefs_[0]
array([[ 1.14338008,  0.20405062, -0.61077927, -0.62307275, -1.23623175,
        -0.55252571,  1.02855366,  0.18325872],
       [ 0.50417409, -0.01189684, -0.627055  ,  0.65451109, -0.99809363,
         1.63383212, -0.40492698,  0.06554955]])

And finally there are six weights connecting the neurons of the second hidden layer to the single neuron in the outout layer (for this binary classification problem).

Exercise 002
  • Inspect the intercepts_ and make sure that you understand their meaning.
  • This should be very easy now: how many parameters (weights and bias terms) does the network in this model contain?

Visualizing the MLP

Execute the next cell to run the external script containing the mlp_draw function. We provide this function (that uses the networkx library) to make it easier to visualize small MLP models trained with sklearn.

Hide the code
%run -i mlp_draw.py

Now we only need to feed the fitted model to function. The orientation,figsize, label_pos and label_size allow some customization of the result. Check the architecture of the network in the plot, and see if you can recognize some of the weights that we saw before.

Hide the code
mlp_draw(model=model, orientation="h", label_pos=2.3/7, add_labels=True)

Another interesting visualization is the decision boundary of this binary classifier that clearly shows that it has learned a nonlinear solution.

Hide the code
disp = DecisionBoundaryDisplay.from_estimator(model, XTR, response_method="predict", cmap="Greens", alpha = 0.5)
sns.scatterplot(dfTR, x=inputs[0], y=inputs[1], hue=output)
plt.show()

Exercise 003

Change the activation function to identity instead of relu (read the sklearn documentation to understand what we are doing). Then refit the MLP and repeat the boundary decision plot. What happens?

Let us now evaluate the classifier performance, using our standard toolset:

Hide the code
# Dataset for Training Predictions
dfTR_eval = dfTR.copy()
# Store the actual predictions
newCol = 'Y_'+ model_name +'_prob_neg'; 
dfTR_eval[newCol] = model.predict_proba(XTR)[:, 0]
newCol = 'Y_'+ model_name +'_prob_pos'; 
dfTR_eval[newCol] = model.predict_proba(XTR)[:, 1]
newCol = 'Y_'+ model_name +'_pred'; 
dfTR_eval[newCol] = model.predict(XTR)
Hide the code
# Test predictions dataset
dfTS_eval = dfTS.copy()
newCol = 'Y_'+ model_name +'_prob_neg'; 
dfTS_eval[newCol] = model.predict_proba(XTS)[:, 0]
newCol = 'Y_'+ model_name +'_prob_pos'; 
dfTS_eval[newCol] = model.predict_proba(XTS)[:, 1]
newCol = 'Y_'+ model_name +'_pred'; 
dfTS_eval[newCol] = model.predict(XTS)

This generates the confusion matrices and shows that classification is almost perfect both in training and test.

Hide the code
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
fig = plt.figure(constrained_layout=True, figsize=(6, 2))
spec = fig.add_gridspec(1, 3)
ax1 = fig.add_subplot(spec[0, 0]);ax1.set_title('Training'); ax1.grid(False)
ax2 = fig.add_subplot(spec[0, 2]);ax2.set_title('Test'); ax2.grid(False)
ConfusionMatrixDisplay.from_estimator(model, XTR, YTR, cmap="Greens", colorbar=False, ax=ax1, labels=[1, 0])
ConfusionMatrixDisplay.from_estimator(model, XTS, YTS, cmap="Greens", colorbar=False, ax=ax2, labels=[1, 0])
plt.suptitle("Confusion Matrices for "+ model_name)
plt.show(); 

The ROC curves tell a similar story. We satisfy ourselves that this looks good enough, and will not spend more time on performance measures. Feel free to dig deeper, you have all the tools from previous sessions.

Hide the code
from sklearn.metrics import RocCurveDisplay
fig = plt.figure(figsize=(12, 4))
spec = fig.add_gridspec(1, 2)
ax1 = fig.add_subplot(spec[0, 0]);ax1.set_title('Training')
ax2 = fig.add_subplot(spec[0, 1]);ax2.set_title('Test')
RocCurveDisplay.from_estimator(model, XTR, YTR, plot_chance_level=True, ax=ax1)
RocCurveDisplay.from_estimator(model, XTS, YTS, plot_chance_level=True, ax=ax2);
plt.suptitle("ROC Curves for "+ model_name)
plt.show(); 


Using Torch (pytorch) to fit the neural network

Hide the code
XTR.mean(axis=0), XTR.std(axis=0)
(array([2.22044605e-18, 4.47385184e-17]), array([1., 1.]))
Hide the code
import torch

XTR_tensors =  torch.from_numpy(XTR)
YTR_tensors =  torch.from_numpy(YTR)

XTS_tensors =  torch.from_numpy(XTS)
YTS_tensors =  torch.from_numpy(YTS)
Hide the code
# Set XTR to be float32
XTR_tensors = XTR_tensors.to(torch.float32)
YTR_tensors = YTR_tensors.to(torch.float32)
XTS_tensors = XTS_tensors.to(torch.float32)
YTS_tensors = YTS_tensors.to(torch.float32)
Hide the code
import torch
import torch.nn as nn

# Explicitly setting the seed for reproducibility (random_state=2024)
torch.manual_seed(2024)

# Defining the architecture
torch_mlp = nn.Sequential(
    # Layer 1: 2 inputs -> 8 neurons
    nn.Linear(2, 8),
    nn.ReLU(),
    
    # Layer 2: 8 neurons -> 6 neurons
    nn.Linear(8, 6),
    nn.ReLU(),
    
    # Layer 3: 6 neurons -> 1 output
    nn.Linear(6, 1),
    nn.Sigmoid()
)

print(torch_mlp)
Sequential(
  (0): Linear(in_features=2, out_features=8, bias=True)
  (1): ReLU()
  (2): Linear(in_features=8, out_features=6, bias=True)
  (3): ReLU()
  (4): Linear(in_features=6, out_features=1, bias=True)
  (5): Sigmoid()
)

Equivalently, we can add layers in an iterative process:

Hide the code
fc1 = nn.Linear(2, 8)
ac1 = nn.ReLU()
    
fc2 = nn.Linear(8, 6)
ac2 = nn.ReLU()
    
out = nn.Linear(6, 1)   
ac_out = nn.Sigmoid()

torch_mlp = nn.Sequential()
torch_mlp.add_module('fc1', fc1)
torch_mlp.add_module('ac1', ac1)
torch_mlp.add_module('fc2', fc2)
torch_mlp.add_module('ac2', ac2)
torch_mlp.add_module('out', out)
torch_mlp.add_module('ac_out', ac_out)
torch_mlp
Sequential(
  (fc1): Linear(in_features=2, out_features=8, bias=True)
  (ac1): ReLU()
  (fc2): Linear(in_features=8, out_features=6, bias=True)
  (ac2): ReLU()
  (out): Linear(in_features=6, out_features=1, bias=True)
  (ac_out): Sigmoid()
)
Hide the code
# 1. Loss function (Binary Cross Entropy)
loss_fn = nn.BCELoss() 

# 2. Optimizer (Adam is the scikit-learn default)
optimizer = torch.optim.Adam(torch_mlp.parameters(), lr=0.001)

# 3. Example Forward Pass
# X_train is a tensor of shape (batch_size, 2)
# outputs = torch_mlp(X_train)
Hide the code
from torch.utils.data import TensorDataset

# Convert to Tensors
train_ds = TensorDataset(XTR_tensors, YTR_tensors)

test_ds = TensorDataset(XTS_tensors, YTS_tensors)
Hide the code
from torch.utils.data import DataLoader
train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=16, shuffle=False)

Let us get one tensor out of train_ds

Hide the code
X0, Y0 = train_ds[0]
X0, Y0
(tensor([0.2201, 0.7972]), tensor(1.))
Hide the code
Y0hat = torch_mlp(X0)
Y0hat, Y0hat.shape, Y0, Y0.shape
(tensor([0.4746], grad_fn=<SigmoidBackward0>),
 torch.Size([1]),
 tensor(1.),
 torch.Size([]))
Hide the code
Y0 = Y0.unsqueeze(-1)
print(Y0hat.shape, Y0.shape)
torch.Size([1]) torch.Size([1])
Hide the code
loss_fn(torch_mlp(X0), Y0)
tensor(0.7452, grad_fn=<BinaryCrossEntropyBackward0>)
Hide the code
# Move the model to MPS/GPU
torch_mlp.to(device)



# Then define the optimizer
optimizer = torch.optim.Adam(torch_mlp.parameters(), lr=0.001)
Hide the code
import datetime

def training_loop(n_epochs, optimizer, model, loss_fn, train_loader):
    
    start_training_time = datetime.datetime.now()
    # Loop over epochs
    for epoch in range(1, n_epochs + 1):  
        
        start_time = datetime.datetime.now()
        loss_train = 0.0
        
        # Loop over batches
        for x, y in train_loader:  
            
            x = x.to(device)
            y = y.to(device)

            # forward pass
            outputs = model(x)
          
            # compute loss
            loss = loss_fn(outputs, y.unsqueeze(-1))

            # reset gradients before backpropagation
            optimizer.zero_grad()
            
            # backpropagation
            loss.backward()
            
            # update parameters
            optimizer.step()

            # update training loss
            loss_train += loss.item()

        end_time = datetime.datetime.now()
        epoch_duration = (end_time - start_time).total_seconds()

        if epoch == 1 or epoch % 10 == 0:
            print('{} Epoch {}, Training loss {:.6f}, Time {:.2f}s'.format(
                datetime.datetime.now(), epoch,
                loss_train / len(train_loader), epoch_duration))
            
    print(f"Training took {end_time - start_training_time} seconds")
Hide the code
training_loop(
    n_epochs = 300,
    optimizer = optimizer,
    model = torch_mlp,
    loss_fn = loss_fn,
    train_loader = train_loader,
)
2026-02-25 12:04:22.690060 Epoch 1, Training loss 0.681838, Time 0.11s
2026-02-25 12:04:23.064979 Epoch 10, Training loss 0.376431, Time 0.04s
2026-02-25 12:04:23.483046 Epoch 20, Training loss 0.218785, Time 0.04s
2026-02-25 12:04:23.877127 Epoch 30, Training loss 0.149364, Time 0.04s
2026-02-25 12:04:24.287179 Epoch 40, Training loss 0.109049, Time 0.04s
2026-02-25 12:04:24.733654 Epoch 50, Training loss 0.082608, Time 0.06s
2026-02-25 12:04:25.181084 Epoch 60, Training loss 0.063948, Time 0.04s
2026-02-25 12:04:25.630506 Epoch 70, Training loss 0.050297, Time 0.04s
2026-02-25 12:04:26.028061 Epoch 80, Training loss 0.040016, Time 0.04s
2026-02-25 12:04:26.449601 Epoch 90, Training loss 0.032604, Time 0.04s
2026-02-25 12:04:26.900133 Epoch 100, Training loss 0.026199, Time 0.06s
2026-02-25 12:04:27.302513 Epoch 110, Training loss 0.021227, Time 0.04s
2026-02-25 12:04:27.708208 Epoch 120, Training loss 0.017491, Time 0.04s
2026-02-25 12:04:28.107713 Epoch 130, Training loss 0.014470, Time 0.04s
2026-02-25 12:04:28.516577 Epoch 140, Training loss 0.012402, Time 0.04s
2026-02-25 12:04:28.920798 Epoch 150, Training loss 0.010324, Time 0.04s
2026-02-25 12:04:29.318640 Epoch 160, Training loss 0.008623, Time 0.04s
2026-02-25 12:04:29.715852 Epoch 170, Training loss 0.007344, Time 0.04s
2026-02-25 12:04:30.105812 Epoch 180, Training loss 0.006205, Time 0.04s
2026-02-25 12:04:30.504336 Epoch 190, Training loss 0.005319, Time 0.04s
2026-02-25 12:04:30.907504 Epoch 200, Training loss 0.004632, Time 0.04s
2026-02-25 12:04:31.325320 Epoch 210, Training loss 0.004128, Time 0.04s
2026-02-25 12:04:31.728984 Epoch 220, Training loss 0.003573, Time 0.04s
2026-02-25 12:04:32.128630 Epoch 230, Training loss 0.003326, Time 0.04s
2026-02-25 12:04:32.524894 Epoch 240, Training loss 0.002968, Time 0.04s
2026-02-25 12:04:32.925884 Epoch 250, Training loss 0.003171, Time 0.04s
2026-02-25 12:04:33.322972 Epoch 260, Training loss 0.002311, Time 0.04s
2026-02-25 12:04:33.700443 Epoch 270, Training loss 0.002012, Time 0.03s
2026-02-25 12:04:34.077299 Epoch 280, Training loss 0.001929, Time 0.04s
2026-02-25 12:04:34.475937 Epoch 290, Training loss 0.001909, Time 0.04s
2026-02-25 12:04:34.875719 Epoch 300, Training loss 0.001845, Time 0.04s
Training took 0:00:12.290901 seconds
Hide the code
# Now let us get the predictions for the training and test sets, and evaluate the performance of the model. We will use the same metrics as before: confusion matrix and ROC curve.

torch_mlp.eval()  # Set the model to evaluation mode

# Get predictions for training set
with torch.no_grad():
    outputs_train = torch_mlp(XTR_tensors)
    preds_train = (outputs_train > 0.5) * 1.0  # Convert probabilities to binary predictions    

# Get predictions for test set
with torch.no_grad():
    outputs_test = torch_mlp(XTS_tensors)
    preds_test = (outputs_test > 0.5) * 1.0  # Convert probabilities to binary predictions)
Hide the code
preds_train[:10]
tensor([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [1.]])
Hide the code
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
fig = plt.figure(constrained_layout=True, figsize=(6, 2))
spec = fig.add_gridspec(1, 3)
ax1 = fig.add_subplot(spec[0, 0]);ax1.set_title('Training'); ax1.grid(False)
ax2 = fig.add_subplot(spec[0, 2]);ax2.set_title('Test'); ax2.grid(False)
ConfusionMatrixDisplay.from_predictions(YTR, preds_train, cmap="Greens", colorbar=False, ax=ax1, labels=[1, 0])
ConfusionMatrixDisplay.from_predictions(YTS, preds_test, cmap="Greens", colorbar=False, ax=ax2, labels=[1, 0])
plt.suptitle("Confusion Matrices for "+ model_name)
plt.show(); 


A Multiclass Example

To illustrate this case we will train a MLP that uses the architecture in Figure 13-11 of (Glassner 2021) that we have seen before (reproduced below for convenience).

This MLP architecture can be used to classify the samples in the classical iris dataset. To see why recall that this dataset uses as inputs four morphological characteristics of the iris flowers, while the output factor Species has three levels.

Hide the code
iris = datasets.load_iris()
X = iris["data"]
X_names = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
df = pd.DataFrame(X, columns=X_names)
Y = "species"
df[Y] = iris["target"] 
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

Therefore, it requires a MLP model with four inputs and three output neurons. Let us create the train test split and the datasets and variables that we will use. Note that we are scaling the input data.

Hide the code
inputs = X_names
output = Y

XTR, XTS, YTR, YTS = train_test_split(df[inputs], df[Y],
                                      test_size=0.2,  # percentage preserved as test data
                                      random_state=1, # seed for replication
                                      stratify = df[Y])   # Preserves distribution of y

iris_scaler = StandardScaler()
XTR = iris_scaler.fit_transform(XTR)
XTS = iris_scaler.transform(XTS)

dfTR = pd.DataFrame(XTR, columns=inputs)
dfTR[output] = YTR
dfTS = pd.DataFrame(XTS, columns=inputs)
dfTS[output] = YTS

Now we can define the model’s architecture and train it. Note that we only care about the number and size of the hidden layers. The classifier will look at the number of levels of the output and use the appropriate size and softmax function for the output layer.

Hide the code
mlp_multi = MLPClassifier(hidden_layer_sizes=(3, 2), activation="relu", max_iter=6000, random_state=2024)
mlp_multi.fit(XTR, YTR)

model = mlp_multi
model_name = "mlp_multi"

After the fit we can confirm the expected structure of the weights for this architecture:

Hide the code
[lyr.shape for lyr in model.coefs_]
[(4, 3), (3, 2), (2, 3)]

Now let us use the mlp_draw function to visualize the fitted network and weights (recall, bias terms are omitted).

Hide the code
mlp_draw(model=model, orientation="h", label_pos=1.3/7)

The score predictions of this MLP, as expected, appear as tuples of three values, the scores for each of the three iris species corresponding to the data point being classified. For example, the first flower in the test set gets these scores. The class is assigned using the highest score.

Hide the code
np.round(model.predict_proba(XTS)[0, :], 3)
array([0.   , 0.001, 0.999])
Hide the code
predicted_class = model.predict(XTS)[0]
iris["target_names"][predicted_class] 
np.str_('virginica')

And in fact you can see that this very simple MLP model achieves almost perfect prediction on the test set.

Hide the code
pd.crosstab(model.predict(XTS), YTS)
species 0 1 2
row_0
0 10 0 0
1 0 10 1
2 0 0 9

The main goal of this example is to illustrate that there is not much of a difference between binary and multiclass MLP models, so we move on to another Machine Learning task.

We keep the model and data to use them below.

Hide the code
iris_data = {'XTR':XTR, 'XTS':XTS, 'YTR':YTR, 'YTS':YTS, 
    'inputs':inputs, 'output':output, 'iris_scaler':iris_scaler}

iris_model = {'mlp_multi':mlp_multi}

Regression with Neural Networks

A Regression Example

We will create a synthetic dataset for a simple regression problem with just one numeric input, but whith a clearly non linear relation with the output.

Hide the code
rng = np.random.default_rng(2024)
n = 1000

X = np.linspace(start=0, stop=5, num = n)
Y = X/2 + np.sin(X) * np.cos(2 * X) + 0.3 * rng.normal(size = n)

df = pd.DataFrame({"X":X, "Y":Y})

fig, ax = plt.subplots(figsize= (8, 3))
# sns.lineplot(x = X, y = Y, ax=ax)
sns.scatterplot(data = df, x = "X", y = "Y", ax=ax, s=5)

Next we define the datasets and variables we are going to use:

Hide the code
inputs = ["X"]
output = "Y"

XTR, XTS, YTR, YTS = train_test_split(df[inputs], df[output],
                                      test_size=0.2,  # percentage preserved as test data
                                      random_state=1)
dfTR = pd.DataFrame(XTR, columns=inputs)
dfTR[output] = YTR
dfTS = pd.DataFrame(XTS, columns=inputs)
dfTS[output] = YTS

And we create a MLP model with two hidden layers. We are using a higher number of neurons here and we have switched to MLPRegressor. But note that again we do not need to describe the output layer in any way, sklearn uses the approprate values for a regression problem. Note also that we are scaling the inputs to get a better behavior in model fit. And in order to do that we use the pipeline framework we know.

Hide the code
mlp_reg = MLPRegressor(hidden_layer_sizes=(80, 20), activation="relu", random_state=2024, max_iter=2000)

reg_scaler = StandardScaler()
reg_scaler.set_output(transform="pandas")

mlp_reg_pipe = Pipeline(steps=[('scaler', reg_scaler), 
                           ('mlp', mlp_reg)]) 

mlp_reg_pipe.fit(XTR, YTR)

model = mlp_reg_pipe
model_name = "mlp_reg"
Hide the code
mlp_reg_data = {'XTR':XTR, 'XTS':XTS, 'YTR':YTR, 'YTS':YTS, 
    'inputs':inputs, 'output':output, 'reg_scaler':reg_scaler}

mlp_reg_model = {'mlp_reg':mlp_reg_pipe}

Let us know look at the model predictions. To visualize the result and get a qualitative measure of the model fit we plot the fitted values against the original data.

Hide the code
# Dataset for Training Predictions
dfTR_eval = dfTR.copy()
# Store the actual predictions
newCol = 'Y_'+ model_name +'_pred'; 
dfTR_eval[newCol] = model.predict(XTR)
Hide the code
fig, ax = plt.subplots(figsize= (8, 3))
sns.scatterplot(data=dfTR_eval, x = "X", y = "Y", color="r", ax=ax)
sns.scatterplot(data=dfTR_eval, x = "X", y = newCol, color="b", ax=ax)

The same kind of plot for the test set illustrates that the model is indeed learning this non linear signal.

Hide the code
# Dataset for Training Predictions
dfTS_eval = dfTS.copy()
# Store the actual predictions
newCol = 'Y_'+ model_name +'_pred'; 
dfTS_eval[newCol] = model.predict(XTS)
fig, ax = plt.subplots(figsize= (8, 3))
sns.scatterplot(data=dfTS_eval, x = "X", y = "Y", color="r", ax=ax)
sns.scatterplot(data=dfTS_eval, x = "X", y = newCol, color="b", ax=ax)

A plot of the network architecture to illustrate the limitations of this visualization approach.

Hide the code
mlp_draw(model["mlp"],  orientation="h", figsize=(16, 16), alpha=0.5, add_labels=False)

But remember that you can always explore the weights and bias terms through the model.

Exercise 004

How many parameters are there in this, our largest MLP so far?


Training MLPs through Gradient Descent and Backpropagation

A Summary of MLP Trainng

How is a MLP model trained? An admittedly short description of the process follows. The rest of this section is devoted to provide details about each of these steps (not necessarily in the order they appear below):

  • We choose a loss function, that measures the error in the model’s fit to the traing data. Our ideal training goal is to bring the loss as close to zero as possible, while at the same time keeping overfitting under control.
  • We choose initial values for the weights and bias terms of the model. Spoiler: random choosing is not a good idea!
  • This and the next step are performed iteratively until some stopping criterion is met. We use the training data or fractions of it called batches to get a prediction of the model and therefore we get an evaluation of the loss function for the current choice of the model parameters. This step is called the forward pass because the information in the network flows forward (following the arrows).
  • Next we use the values collected in the forward pass and the backpropagation algorithm to compute the gradient of the loss function with respect the parameters in each one of the network layers. With this gradient and some optimization method like gradient descent and its relatives we uodate all the weights and bias of the model.

Loss Functions

The loss function measures the error makes in its predictions. And, as you already know, we use different expressions of the error depending on the nature of the model’s predictions. That is, depending on the kind of Supervised Machine Learning problem.

For regression the most commonly used loss functions are the \({\cal L}_{MAE}\) and \({\cal L}_{MSE}\), defined as: \[ {\cal L}_{MAE} = \dfrac{1}{n}\sum_{i = 1}^n|\hat y_i - y_i|,\qquad {\cal L}_{MSE} = \dfrac{1}{2n}\sum_{i = 1}^n (\hat y_i - y_i)^2 \] Here \(y_i\) is a true/observed value and \(\hat y_i\) is the corresponding prediction of the model.

For classification (either binary or multiclass) the most widely used loss function is the cross-entropy, defined as: \[{\cal L}_{ent} = -\sum_{i = 1}^n\sum_{j=1}^m y_{ij}\log\hat y_{ij}\] where \(y_{i1}, y_{i2}, \ldots, y_{im}\) is the one hot encoding of the \(i-th\) input and \(\hat y_{i1}, \hat y_{i2}, \ldots, \hat y_{im}\) the scores assigned by the model, for each of the \(m\) levels/classes of the output factor. For the binary case this boils down to \[{\cal L}_{ent} = -\sum_{i = 1}^n (y_j\log\hat y_j + (1 - y_j)\log(1 - \hat y_j))\]

By the way, MLPRegressor always uses what we call \({\cal L}_{MSE}\), while MLPClassifier always uses the cross-entropy \({\cal L}_{ent}\). In other libraries there is more freedom to choose other loss functions or even define your own, taylored to the problem (e.g. to use different weights for the classes).

All these loss functions share some properties that are also critical for their use with the optimization algorithms we use while fitting the models: 1. They are differentiable functions with respect to their arguments. 2. They are sums over a set of observations. And therefore they can be estimated using any number of observations.

Let us emphasize this idea by playing with the code. Take the training data and fitted model of regression example:

Hide the code
YTR = mlp_reg_data["YTR"]
XTR = mlp_reg_data["XTR"]
model = mlp_reg_model["mlp_reg"]

Now let us use this data and model to get the predictions for training, that correspond to what we called \(\hat y_i\) above. We show the first ones.

Hide the code
YTR_pred = model.predict(XTR)
YTR_pred[:10]
array([ 0.26670396,  3.69201112,  3.6049855 ,  0.37045803,  1.58530131,
        1.54973299,  0.36139089,  0.0949725 , -0.18409113,  3.38742146])

These YTR_pred predictions are the result of the forward pass of the entire training set through the network, using the present weights and bias terms of the model. And now we can use them to get the value of the loss function (properly speaking, an estimate of the value of the loss function):

Hide the code
loss = (1/2) * ((YTR_pred - YTR)**2).mean()
loss
np.float64(0.04931940119577732)

But if we change the size and/or values of the inputs, creating for example a new shorter array:

Hide the code
Xnew = np.random.default_rng(2024).normal(size = 10).reshape(-1, 1)
Xnew
array([[ 1.02885687],
       [ 1.64192004],
       [ 1.14671953],
       [-0.97317952],
       [-1.3928001 ],
       [ 0.06719636],
       [ 0.86135092],
       [ 0.5091868 ],
       [ 1.81028557],
       [ 0.75084347]])

and the associated output values:

Hide the code
Ynew = Xnew/2 + np.sin(Xnew) * np.cos(2 * Xnew) + 0.3 * rng.normal(size = Xnew.shape[0])

Then we can run them through the betwork (forward pass) and get a prediction:

Hide the code
Ynew_pred = model.predict(Xnew)
Ynew_pred[:10]
/opt/conda/envs/mlmiin/lib/python3.11/site-packages/sklearn/utils/validation.py:2749: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(
array([ 0.14799548, -0.2376762 ,  0.05769325,  0.31106307,  0.2932709 ,
        0.35958583,  0.27633267,  0.38818285,  0.04876662,  0.36026116])

Then we can get a new value of the loss:

Hide the code
loss_new = (1/2) * ((Ynew_pred - Ynew)**2).mean()
loss_new
np.float64(0.09540873679602235)

The neural network does not care if this values belong to the training set, or even about how many they are. But we can take a step further to better understand training. Below we see the 20 weights in the (fitted to training data) outer layer:

Hide the code
model["mlp"].coefs_[2]
outer_weights = model["mlp"].coefs_[2]
outer_weights
array([[-5.25548563e-01],
       [-1.82610993e-01],
       [-2.07258451e-02],
       [ 5.01400632e-01],
       [-2.10009181e-01],
       [ 1.33174938e-02],
       [-4.89055375e-01],
       [ 2.01155719e-02],
       [ 2.95700964e-01],
       [-1.40096016e-01],
       [-2.77540400e-01],
       [ 1.99960980e-09],
       [-3.39275348e-07],
       [-3.11143961e-01],
       [ 4.89647944e-01],
       [ 5.74968143e-01],
       [-4.48779105e-01],
       [ 4.07173610e-01],
       [ 1.26540951e-06],
       [-4.77446082e-01]])

Let us modify them, replacing them with arbitrary uniform values:

Hide the code
model["mlp"].coefs_[2] = np.random.default_rng(2024).uniform(low = 0, high = 0.1, size = 20).reshape(-1, 1)
model["mlp"].coefs_[2] 
array([[0.06758313],
       [0.02143232],
       [0.0309452 ],
       [0.07994661],
       [0.09958021],
       [0.01422318],
       [0.00787255],
       [0.01808238],
       [0.03596469],
       [0.01696192],
       [0.05887593],
       [0.06168075],
       [0.01053857],
       [0.05657311],
       [0.00046296],
       [0.04651192],
       [0.09756222],
       [0.07994284],
       [0.05968224],
       [0.03253497]])

Now we have modified the neural network itself. In fact, in this case we have untrained it (effectively ruining it). To see this, let us run again the training set through this altered neural network to get the predictions. You can compare them with the predictions of the original fitted and yet untampered network.

Hide the code
YTR_pred = model.predict(XTR)
YTR_pred[:10]
array([0.39286751, 0.56660871, 0.55717158, 0.47475645, 0.3529616 ,
       0.34419725, 0.48586723, 0.39680396, 0.4032007 , 0.53357875])

Accordingly, the loss function value is different and much worse!

Hide the code
loss = (1/2) * ((YTR_pred - YTR)**2).mean()
loss
np.float64(0.9872144488030443)

Before we continue let us restore the model to its pristine fitted state:

Hide the code
model["mlp"].coefs_[2] = outer_weights
The Loss is a Function of the Weights

It is critically important to understand this examples: the loss function depends on the weights and bias terms. And we can try to train the network by modifying those weights and biases to get lower values of loss for the training data.

Differently: when you look at the previous expressions for the loss you see \(y\) and \(\hat y\). The \(y\) values belong to the dataset and there is nothing trainable about them. But the \(\hat y\) are the result of the forward pass of the training data through the neural network. And that means that they are computed using all the weights and bias terms (parameters). Weights and biases are the trainable components of the network.

Let us use the models that we have fitted to illustrate the role played by the loss function. For example, this is the graph of the values of the loss function for the iris dataset as the training of the neural network proceeds:

Hide the code
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fig, axs = plt.subplots(1, 3, figsize=(16, 3))
    axs[0].set_title("Binary Classification (make_circles) Loss curve")
    sns.lineplot(make_circles_model['mlp_binary'].loss_curve_, ax=axs[0])
    axs[1].set_title("Multilevel Classification (iris) Loss curve")
    sns.lineplot(iris_model['mlp_multi'].loss_curve_, ax=axs[1])
    axs[2].set_title("Regression Loss curve")
    sns.lineplot(mlp_reg_model['mlp_reg']['mlp'].loss_curve_, ax=axs[2])

The three pictures illustrate that the fitting process proceeds by searching for smaller values of the loss.


Regularization

Keep in mind that our goal when working with loss functions is to get the minimum value, as they are error functions. However, we need to fight overfitting while doing this. A large neural network (in terms of trainable paremeters) is a very flexible model. That is why these loss functions are often combined with regularization techniques similar to the Ridge regression and Lasso that we have seen for linear models. For example the MLPClassifier and MLPRegressor of sklearn both use an argument alpha that controls the regularization using a Ridge-like term added to the loss function.

Hide the code
alphas = [10**k for k in range(-5, 5, 2)]

loss_curves = []
test_scores = []

for i in range(len(alphas)):
    mlp_reg_batch = MLPRegressor(hidden_layer_sizes=(80, 20), alpha=alphas[i],
        activation="relu", random_state=2024, max_iter=2000)

    reg_scaler_batch = StandardScaler()
    reg_scaler_batch.set_output(transform="pandas")

    mlp_reg_batch_pipe = Pipeline(steps=[('scaler', reg_scaler_batch), 
                           ('mlp', mlp_reg_batch)]) 

    mlp_reg_batch_pipe.fit(XTR, YTR)
    
    test_scores.append(mlp_reg_batch_pipe.score(XTS, YTS))

    model_batch = mlp_reg_batch_pipe
    loss_curves.append(model_batch['mlp'].loss_curve_)
Hide the code
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fig, axs = plt.subplots(len(alphas), 1, figsize=(6, 12))
    for i in range(len(alphas)):
        sns.lineplot(loss_curves[i], ax=axs[i])
        axs[i].set_title(f"""Loss curve for regularization alpha = {alphas[i]}\n 
                            Final loss {np.round(loss_curves[i][-1], 3)}, 
                            Test Score {np.round(test_scores[i], 3)}""")
fig.tight_layout()
    
    

Dropout

Dropout is another form of regularization in which we randomly disconnect some neurons on one layer when computing the activations. This is similar in spirit to what we did in random forests: restricting the information available for the model helps prevent underfitting and forces the model to focus on the overall signal in the data instead of getting lost in details.

This dropout methos id only used during training. When making predictions, all neurons in the network and their associated weights are used. Dropout is often implemented as an extra layer placed in between orinary layers, that takes care of this switching off of some neurons.


Gradient Descent

In a typical first year Calculus students learn that if a function \(f\) is differentiable, then the graph of the function is steepest in the direction of the the gradient. Therefore, if you are aiming for smaller values of \(f\), eventually reaching a minimum, then a seemingly sensible idea is to move in the opposite direction of the gradient.

This can be made into an iterative process in discrete steps as follows: 1. If you are located at point \(\bar p\), find the gradient at that point \(\nabla f(\bar p)\). We will explain below how to do this using backpropagation, 2. Take a step of a certain length \(\gamma\) in the direction that is opposite to the gradient. That is, your new position is \[\bar q = \bar p - \gamma\nabla f(\bar p)\] 3. Repeat the operation at \(\bar q\): find \(\nabla f(\bar q)\), use it to take a \(\gamma\) sized step, and so on.

The \(\gamma\) parameter is called the learning rate.

This iterative process is the basic idea of Gradient descent numerical optimization algorithms. You can visualize and experiment with the idea using this construction by Ben Frederiksson, which also contains visualizations for some other optimization methods. Or you can use our own GeoGebra visualization.

By playing with this visualizations you will discover that the choice of the learning parameter is critical: a too small value and the method may never converge. A too large value and the method may overshoot the minimum, possibly not converging either. There are further difficulties: the initial point of the oterations plays an important part in the success or failure of the method. Besides, if the function has several local minimum values the method may converge to one such local minimum and fail to find the global minimum value.

The appeal of gradient descent is obvious, however, and it is based in a very simple yet powerful idea: the only critical skill it requires is the ability to compute the gradient of the function wherever you are. And by “wherever you are” we mean a point in the space of parameters (weights and biases) of the network.

We have seen before how running the training set through the forward pass of the network will give us the value of the loss function. That loss value depends, as we keep repeating, on the current selection of weights and biases for the network. The entire selection of weights and biases is the \(\bar p\) in the description of gradient descent. Assume for a moment that we already know how to compute the gradient \(\nabla\cal L(\bar p)\). Then we apply the equation in the second step of gradient descent to change each weight and bias in the network. Then we run the training set through the network again, get a new gradient and so on. That is the essence of how we train the network.


Batches and Stochastic Gradient Descent, Epochs.

The above discussion assumes that we use the whole training set in each iteration of gradient descent. But we have also seen that we can use any subset of the input data to get an estimate of the value of the loss function. And we will see that below that we can also use that smaller part of the data to get an estimate of the gradient \(\nabla\cal L(\bar p)\). Of course, an estimate based in a smaller data sample is expected to be a lower quality estimate. So, why would we do that?

First, because selecting a smaller subset speeds up computation. Second, because using a theoretically worse version of the gradient has empirically proved to help the training process. Possibly by escaping local minima that could act like a trap for the gradient descent iterations.

This method of using smaller subsets of the training data is called stochastic gradient descent (because each gradient we compute will be a random or stochastic estimate of the gradient for the whole training set). And it is organized as follows: 1. We shuffle (randomly permute) the training data. 2. We divide the shuffled data into batches (or minibatches, depending on the authors). They are usually sized as a power of 2 to benefit from the computing architecture of our computers. This implies that there may be a final batch whose size is smaller. For example, with 100 trainig data and 16-sized minibatches you will end up with 6 full sized minibatches and a last one with only 4 elements, since \(6\cdot 16 + 4 = 100\). 3. Each minibatch is used in turn to make a forward pass, compute the loss gradient with backpropagation and uodate the network parameters. The result of this set of operations for all minibatches is called an epoch of training. 4. Then we start a new epoch and repet the above three steps. Training proceeds this way until some stopping criterion is met, concerning the changes on the loss function or a predefined maximal number of epochs.

In scikit-learn we can control this with the batch_size argument of both MLPClassifier and MLPRegressor.

Hide the code
batch_sizes = [2, 64, 256, XTR.shape[0]]

loss_curves = []

for i in range(len(batch_sizes)):
    mlp_reg_batch = MLPRegressor(hidden_layer_sizes=(80, 20), batch_size= batch_sizes[i],
        activation="relu", random_state=2024, max_iter=2000)

    reg_scaler_batch = StandardScaler()
    reg_scaler_batch.set_output(transform="pandas")

    mlp_reg_batch_pipe = Pipeline(steps=[('scaler', reg_scaler_batch), 
                           ('mlp', mlp_reg_batch)]) 

    mlp_reg_batch_pipe.fit(XTR, YTR)

    model_batch = mlp_reg_batch_pipe
    loss_curves.append(model_batch['mlp'].loss_curve_)
Hide the code
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fig, axs = plt.subplots(1, len(batch_sizes), figsize=(16, 3))
    for i in range(len(batch_sizes)):
        sns.lineplot(loss_curves[i], ax=axs[i])
        axs[i].set_title(f"Loss curve for batch size {batch_sizes[i]}\n Final loss {np.round(loss_curves[i][-1], 3)}")
    
    

The results show that, all other things equal, using the whole training set (last loss curve) is not a good idea. Apparently it fails to reach the minimum. Note also that the first curve is much noisier, due to the fact that the estimates of the gradient are much more unstable.


Choosing the Learning Rate. Momentum.

Choosing the learning rate is often a critical choice when training a neural network. Recall that the learning rate controls the size of the step we take when performing gradient descent. On the one hand, if we take too large a step we risk jumping over the minimums or bouncing endlessly without finding a good fit of the model. If, on the other hand, we choose a step size too small, we risk getting stuck in a plateau of the loss function and not reaching a minimum.

A related idea is that of momentum. This physically inspired method adds a gradient memory to the gradient descen strategy. That means that, to take the next step, instead of using only the latest value of the gradient we add a fraction of the previous gradient. This has been shown to improve the convergence of the method to better (smaller) loss values.

Let us use the example of a neural network for regression that we have trained and see the impact of different learning rates. First we turn off momentum entirely.

Technical note: we have switched to solver = 'sgd' so that we can keep a constant value of the learning rate during the whole training process. We will return to this below.

Hide the code
learning_rates = [0.00001, 0.005, 0.01, 0.05, 0.1, 0.5]

loss_curves = []
train_scores = []

for i in range(len(learning_rates)):
    # print(learning_rates[i])
    mlp_reg_batch = MLPRegressor(hidden_layer_sizes=(80, 20), learning_rate='constant', solver='sgd',
        learning_rate_init=learning_rates[i], momentum = 0, activation="relu", random_state=2024, max_iter=2000)

    reg_scaler_batch = StandardScaler()
    reg_scaler_batch.set_output(transform="pandas")

    mlp_reg_batch_pipe = Pipeline(steps=[('scaler', reg_scaler_batch), 
                           ('mlp', mlp_reg_batch)]) 

    mlp_reg_batch_pipe.fit(XTR, YTR)
    
    train_scores.append(cross_val_score(mlp_reg_batch_pipe, XTR, YTR, scoring='neg_mean_squared_error').mean())

    model_batch = mlp_reg_batch_pipe
    loss_curves.append(model_batch['mlp'].loss_curve_)
/opt/conda/envs/mlmiin/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
  warnings.warn(
/opt/conda/envs/mlmiin/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
  warnings.warn(
/opt/conda/envs/mlmiin/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
  warnings.warn(
/opt/conda/envs/mlmiin/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
  warnings.warn(
/opt/conda/envs/mlmiin/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
  warnings.warn(
/opt/conda/envs/mlmiin/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:781: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
  warnings.warn(

Now let us see the loss curves and validation scores of the different learning rates.

Hide the code
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fig, axs = plt.subplots(len(learning_rates), 1, figsize=(6, 12))
    for i in range(len(learning_rates)):
        sns.lineplot(loss_curves[i], ax=axs[i])
        axs[i].set_title(f"""Loss curve for learning rate = {learning_rates[i]}\n 
                            Final loss {np.round(loss_curves[i][-1], 3)}, 
                            Training Score (negMSE) {np.round(train_scores[i], 3)}""")
fig.tight_layout()
    
    

The result of the experiment confirms what we said before. The smallest learning rate value gives a very bad result which fails to find any interesting loss value after 2000 iterations. We are simply moving too slow. At the other side of the experiment, the largest learning rate results in a bouncing loss function that does not provide a good fit. Meanwhile, some intermediate values of the learning rate seem to be doing much better.

Let us now bring back momentum; it is on by default, so we only need to remove the momentum=0 option and rerun the code.

Hide the code
learning_rates = [0.00001, 0.005, 0.01, 0.05, 0.1, 0.5]

loss_curves = []
train_scores = []

for i in range(len(learning_rates)):
    # print(learning_rates[i])
    mlp_reg_batch = MLPRegressor(hidden_layer_sizes=(80, 20), learning_rate='constant', solver='sgd',
        learning_rate_init=learning_rates[i], activation="relu", random_state=2024, max_iter=2000)

    reg_scaler_batch = StandardScaler()
    reg_scaler_batch.set_output(transform="pandas")

    mlp_reg_batch_pipe = Pipeline(steps=[('scaler', reg_scaler_batch), 
                           ('mlp', mlp_reg_batch)]) 

    mlp_reg_batch_pipe.fit(XTR, YTR)
    
    train_scores.append(cross_val_score(mlp_reg_batch_pipe, XTR, YTR, scoring='neg_mean_squared_error').mean())

    model_batch = mlp_reg_batch_pipe
    loss_curves.append(model_batch['mlp'].loss_curve_)
Hide the code
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fig, axs = plt.subplots(len(learning_rates), 1, figsize=(6, 12))
    for i in range(len(learning_rates)):
        sns.lineplot(loss_curves[i], ax=axs[i])
        axs[i].set_title(f"""Loss curve for learning rate = {learning_rates[i]}\n 
                            Final loss {np.round(loss_curves[i][-1], 3)}, 
                            Training Score (negMSE) {np.round(train_scores[i], 3)}""")
fig.tight_layout()
    
    

The impact of momentum is a global improvement of the behavior across all learning rates. And the ordering is preserved in this case, the best learning rates are still similar, but the convergence of the method improves and we get better final values of the loss.


Optimizers

The term optimizers describes a set of tools designed to speed up learning by improving the efficiency of gradient descent (see Chapter 15 of (Glassner 2021)). For example, instead of keeping a constant learning rate, we can use a learning rate schedule method. A natural idea is to take longer steps at the beginning, trying to move swiftly across the possible plateaus in the loss function, and then slowing down as we may come closer to a minimum.

This strategy can be easily accomplished by selecting a smaller than 1 decay rate \(s\) and then multiplying the learning rate \(\lambda\) after each step by this decay rate. That means that instead of a constant learning rates now we are using an exponentially decaying sequence of (decreasing to zero) learning rates: \[ \lambda, s\,\lambda, s^2\,\lambda,\ldots, s^k\,\lambda,\ldots \rightarrow 0 \]

The idea behind this is illustrated by this Figure 15-16 from (Glassner 2021).

There are many possible variants vased on this basic idea. We can apply the learning rate update after each epoch,instead of after every gradient descent step. Or we can monitor the loss and change the learing rate using that information (e. g. slowing down if we detetct that bouncing). The idea of momentum can be considered as well to be an optimizer. There are indeed improved version of momentum, notably Nesterov’s momentum: it combines momentum the past with anticipated gradient values from the future to obtain a corrected update of the gradient that has been shown to perform usually better than simple momentum. In scikit-learn you can use it when you (manually) select stochastig gradient descent.

Another family of optimizer techniques adapt the learning rate to each individual parameter (weights and bias) of the network. The most popular of these adaptive methods is called Adam (from Adaptive Moment Estimation). You may think of the Adam and related methods as finding a decay rate taylored for each individual parameter, which takes into account the history of changes that the parameter has undergone, ranking them by how recent they are. The MLP functions of scikit-learn use Adam and by default to implement gradient descent.


Backpropagation

If gradient descent is the engine of neural networks, backpropagation (backprop for short) provides the fuel. That is, the gradient. And it is essentially a clever application of the chain rule from Calculus and some dynamic programming ideas to optimize the computation and avoiding repeatedly computing the same quantity.

Here we will focus on the chain rule part of backprop, which is conceptually the cornerstone of the algorithm. To illustrate the idea we consider the neural network in our regression example, pictured below. Recall that the loss function \(\cal L\) is a function of the weights and bias terms of the network. And let us suppose that we want to compute the following component of the gradient: \[\dfrac{\partial \cal L}{\partial w^{(1)}_{04}}\] where \(w^{(1)}_{04}\) is the weight for the green colored edge of the graph.

To understand the notation consider the picture below. We are closely following here the notation in Chapter 10 of (James et al. 2023):

  • The weights in the \(r-th\) hidden layer number are called \(w^{(r)}_{ij}\). The subscripts \(i\) and \(j\) indicate the numbers of the neurons they connect.
  • For the special case of the input layer, \(i\) indicates the input number. Thus, the weight \(w^{(1)}_{04}\) we are considering connects the input \(X_1\) to the fourth neouron in the first hidden layer.
  • The bias term of the \(j-th\) neuron in the \(r-th\) hidden layer is indicated by \(w^{(r)}_{Bj}\).
  • The \(j-th\) neuron in the \(r-th\) layer computes the activation \[A^{(r)}_{j} = h^{(r)}(z^{r}_{j}) = h^{(r)}\left(w^{(r)}_{Bj} +\underbrace{w^{(r)}_{0j}A^{(r-1)}_{0j} + w^{(r)}_{1j}A^{(r-1)}_{1j} + \cdots}_{\text{ sum over all neurons of preceding layer}} \right)\] where \(h^{(r)}\) is the common activation function for the \(r-th\) hidden layer (e.g. relu).
    For the special case of the first hidden layer, the activations \(A^{(r-1)}_{kj}\) must be replaced with the \(k\)-th input value.
  • For the output layer we use the superscript \((out)\). Thus \(h^{(out)}\) is the output layer activation function (usually the identity in regression), and \(w^{(out)}_{rs}\) indicates a weight in the output layer. In regression there is only a neuron in the ourput layer and so in principle the second subscript could be omitted, but we leave it there for greater generality.

What do the red and blue edges represent? In this regression problem, where the loss is \[{\cal L}_{MSE} = \dfrac{1}{2n}\sum_{i = 1}^n (\hat y_i - y_i)^2\] the predicted value \(\hat y_i\) is computed using the activations in the second hidden layer and the weights \(w^{(out)}_{rs}\) of the output layer. These weights are the rightmost group pf colored edges.

But then we must realize that \(\hat y_k\) has been obtained And each neuron of the second hidden layer is computed from activations coming from the first hidden layer. The second group of red edgex identify how the activation \(A^{(1)}_{4}\) (first hidden layer, neuron number 4) is used as input in each of the neurons of the second hidden layer. And we care about neuron number 4 because the (green edge) weight \(w^{(1)}_{04}\) is one of the inputs of that neuron (and only that neuron).

To apply the chain rule we need to think of all the composition paths that connect the loss function value to \(w^{(1)}_{04}\). Now you can see that this is what the red edges represent. We must compute a product of derivatives along each one of thoose paths and then we add up the results. The blue colored path is a particular example of one such composition paths. Let us follow the chain of derivatives of this blue path in the backward direction of the algorithm. + We can begin by computing \[ \dfrac{\partial \cal L}{\partial \hat y_i} = \dfrac{1}{n}\sum_{i = 1}^n (\hat y_i - y_i) \] Recall that \(\hat y_i\) is the prediction that the network outputs for input values \((X_i0, X_{i1})\).
Now \[\hat y_i = h^{out}(z_{out}) = h^{out}\left( w^{out}_{B} + w^{out}_{0}A^{(2)}_0 + w^{out}_{0}A^{(2)}_0 +\cdots + \color{blue}w^{out}_{3}A^{(2)}_3\color{black} + \cdots + w^{out}_{5}A^{(2)}_5 \right)\] and so the next derivative in the blue chain is: \[\dfrac{\partial \hat y_i}{\partial A^{(2)}_3} = (h^{out})'(z_{out})\cdot w^{out}_{3}\] Now we have reached \(A^{(2)}_3\) with \[ A^{(2)}_3 = h^{2}(z^{(2)}_3) = h^{(2)}\left( w^{(2)}_{B3} + w^{(2)}_{03}A^{(1)}_{03} + w^{(2)}_{13}A^{(1)}_{1} +\cdots + \color{blue}w^{(2)}_{43}A^{(1)}_4\color{black} + \cdots + w^{(2)}_{73}A^{(1)}_7 \right) \] The next derivative in our path is therefore: \[\dfrac{\partial A^{(2)}_3}{\partial A^{(1)}_4} = (h^{2})'(z^{(2)}_3)\cdot w^{(2)}_{43}\] Finally (in this really simple neural network) we reach \(A^{(1)}_4\) with \[ A^{(1)}_4 = h^{1}(z^{(1)}_4) = h^{(1)}\left( w^{(1)}_{B4} + \color{green}w^{(1)}_{04}X_{i0}\color{black} + w^{(1)}_{14}X_{i1} \right) \] The green color highlights that we have reached the desired weight \(w^{(1)}_{04}\) and so the final derivatives in the chain are: \[\dfrac{\partial A^{(1)}_4}{\partial w^{(1)}_{04}} = (h^{1})'(z^{(1)}_4)\cdot X_{i0}\] Putting all the pieces toghether the product of derivatives along the blue path is: \[\dfrac{\partial \cal L}{\partial \hat y_i}\cdot(h^{out})'(z_{out})\cdot w^{out}_{3}\cdot (h^{2})'(z^{(2)}_3)\cdot w^{(2)}_{43}\cdot (h^{1})'(z^{(1)}_4)\cdot X_{i0}\] The derivative \[\dfrac{\partial \cal L}{\partial w^{(1)}_{04}}\] is a sum of six products like this, one for each red or blue path in the picture. Remember that the forward pass has provided us with all the weights and all the affine combinations \(z^{(i)}_j\). In particular, we know all the values that appear in the above product, and so we can compute the gradient.

The above construction for the gradient may seem intimidating at first, but it should be lcear that we can use matrix computations to organize the chain rule and parallelize it during computation In many cases we find it convenient to think in terms of higher dimensional stacks of arrays, so called tensors. That is the origin of the name of one of the mfirst and most popular deep learning libraries, tensorflow.

The graph below, from a video at 3Blue1Brown, conveys the idea of the backpropagation algorithm as a backward flow of tensor information used to update the gradient.


Recommended reading:

See References section at the end for details.

We also recommend the Neural Networks section of the 3Blue1Brown website.

References

Fitch, Frederic B. 1944. “McCulloch Warren s. And Pitts Walter. A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, Vol. 5, Pp. 115–133.”
Glassner, Andrew. 2021. Deep Learning: A Visual Approach. No Starch Press. https://nostarch.com/deep-learning-visual-approach.
Hecht-Nielsen, Robert. 1990. Neurocomputing. Addison-Wesley.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1990. “Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks.” Neural Networks 3 (5): 551–60. https://doi.org/https://doi.org/10.1016/0893-6080(90)90005-6.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. An Introduction to Statistical Learning with Applications in Python. Springer International Publishing. https://doi.org/10.1007/978-3-031-38747-0.